JudgeLM: Fine-tuned Large Language Models are Scalable Judges

JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement.